Credit Card Users Churn PredictionΒΆ
Problem StatementΒΆ
Business ContextΒΆ
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customersβ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same β so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
Data DescriptionΒΆ
- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age: Age in Years
- Gender: Gender of the account holder
- Dependent_count: Number of dependents
- Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
- Marital_Status: Marital Status of the account holder
- Income_Category: Annual Income Category of the account holder
- Card_Category: Type of Card
- Months_on_book: Period of relationship with the bank (in months)
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: Total Revolving Balance on the Credit Card
- Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Trans_Ct: Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?ΒΆ
- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?ΒΆ
- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?ΒΆ
- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:ΒΆ
- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1
Please read the instructions carefully before starting the project.ΒΆ
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
- Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
- Identify the task to be performed correctly, and only then proceed to write the required code.
- Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
- Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
- Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.
Importing necessary librariesΒΆ
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn xgboost==2.0.3 -q --user
!pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree # for DecisionTreeClassifier
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay
)
# To impute missing values
from sklearn.impute import SimpleImputer
from sklearn import metrics
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Loading the datasetΒΆ
#Loading dataset
data=pd.read_csv("BankChurners.csv")
Data OverviewΒΆ
- Observations
- Sanity checks
View first five rows of the datasetΒΆ
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Check data types and number for non-null values from each columnΒΆ
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observation
- There are 20 columns with 10127 rows of data.
Marital_Statuscolumn andEducational_Levelcolumn have missing data- There are several object types:
Attribution_Flag,Gender,Educational_Level,Marital_Status,Income_Category,Card_Category
CLIENTNUMrepresents a key and is not numerically significant
Next step: The number of non-null values of each column is equal to the number of total rows in the dataset i.e. no null value. We can further confirm this using isna() method.
data.isna().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observation
As expected:
Educational_Levelis missing about 15% of the rowsMarital_Statusis missing is about 7% of the rows
Summary of the datasetΒΆ
# Summary of the continuous columns
data[['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations
Mean and median value for
Customer_Ageis approx 46Mean and median value for
Months_on_bookis approx 36Mean and median value for
Dependent_countis approx 2, although there are outliersMean and median value for
Total_Relationship_Countis approx 4, although there are outliersOutliers may be significant in
Months_Inactive_12_mon,Contacts_Count_12_monTotal_Revolving_Bal,Total_Amt_Chng_Q4_Q1,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1all have many outliersRight skewed data (Mean > Median):
Credit_Limitwhere the mean is 8.6K and the median is 4.5K with outliersAvg_Open_To_Buywhere the mean is 7.4K and the median is 3.4K with outliersTotal_Trans_Amtwhere the mean is 4.4K and the median is 3.9K with many outliersAvg_Utilization_Ratiowhere the mean is .27 and the median is .18
Number of unique columnsΒΆ
data.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
Observations
- Can drop the
CLIENTNUMcolumn as it is an ID variable and will not add value to the model
# Dropping columns from the dataframe
data.drop(columns=['CLIENTNUM'], inplace=True)
Number of observations in each categoryΒΆ
cat_cols=['Attrition_Flag','Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
for column in cat_cols:
print(data[column].value_counts())
print('-'*30)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ------------------------------ F 5358 M 4769 Name: Gender, dtype: int64 ------------------------------ Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ------------------------------ Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ------------------------------ Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ------------------------------ Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ------------------------------
Observations
Attrited Customerrepresents 19% of the customer base- 47% of the customer base are males
- There are six catagories for
Education_Level - There are three categories for
Marital_Status - There is a nonsense category in
Income_Categorywithabcas data that represents about 11% of customer data - There are very few
Platinumcustomers
Make a copy of the datasetΒΆ
df = data.copy()
# `df` is not used in this notebook, but is available for future use
Exploratory Data Analysis (EDA)ΒΆ
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- How is the total transaction amount distributed?
- What is the distribution of the level of education of customers?
- What is the distribution of the level of income of customers?
- How does the change in transaction amount between Q4 and Q1 (
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)? - How does the number of months a customer was inactive in the last 12 months (
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)? - What are the attributes that have a strong correlation with each other?
The below functions need to be defined to carry out the Exploratory Data Analysis.ΒΆ
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
### Univariate analysis
Compare the Attrition_FlagΒΆ
Before going into the question, let's view the size of the attrition values.
labeled_barplot(data,'Attrition_Flag',perc=True)
How is the total transaction amount distributed?ΒΆ
histogram_boxplot(data, 'Total_Trans_Amt')
Observations
- The data has four identifiable peaks
- The data is right skewed
What is the distribution of the level of education of customers?ΒΆ
labeled_barplot(data,'Education_Level',perc=True)
Observations
- The largest number is
Graduate - NOTE: There was missing data in education category
What is the distribution of the level of income of customers?ΒΆ
labeled_barplot(data,'Income_Category',perc=True)
Observations
- A significant number have incomes less than $40K
abcis a nonsense field- The
40K - 60K,60K - 80K, and80K-120Khave around the same values, roughly half the number ofless than $40K
Bivariate analysisΒΆ
How does the change in transaction amount between Q4 and Q1 (Total_Ct_Chng_Q4_Q1) vary by the customer's account status (Attrition_Flag)?ΒΆ
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x="Attrition_Flag", y="Total_Ct_Chng_Q4_Q1", kind="boxen", data=data, height=7);
sns.swarmplot(data, x="Total_Ct_Chng_Q4_Q1", y="Attrition_Flag")
<Axes: xlabel='Total_Ct_Chng_Q4_Q1', ylabel='Attrition_Flag'>
sns.catplot(data, x="Total_Ct_Chng_Q4_Q1", y="Attrition_Flag", kind="violin")
<seaborn.axisgrid.FacetGrid at 0x2202f2d6310>
sns.histplot(data, x="Total_Ct_Chng_Q4_Q1", hue="Attrition_Flag", multiple="dodge", bins=30)
<Axes: xlabel='Total_Ct_Chng_Q4_Q1', ylabel='Count'>
Observation
As the charts show, Attrited Customer has lower Total_Ct_Chng_Q4_Q1.
- The count is lower for
Attrited Customer - The mean is lower for
Attrited Customer
How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?ΒΆ
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x="Attrition_Flag", y="Months_Inactive_12_mon", kind="boxen", data=data, height=7);
sns.histplot(data, x="Months_Inactive_12_mon", hue="Attrition_Flag", multiple="dodge")
<Axes: xlabel='Months_Inactive_12_mon', ylabel='Count'>
What are the attributes that have a strong correlation with each other?ΒΆ
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(data.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="Spectral")
plt.show()
Observations
Positive correlations
| Type | Values | Column | Value | Description | |
|---|---|---|---|---|---|
| Strong positive correlation | r > 0.75 | Average_Open_To_Buy and Credit_Limit |
1.0 | Essentially two ways of saying the same thing β that you have more credit available to you | |
| Strong positive correlation | r > 0.75 | Total_Trans_Amt and Total_Trans_Ct |
.81 | The total amounts and counts are highly correlated | |
| Strong positive correlation | r > 0.75 | Months_on_book and Customer_Age |
.79 | The older the customer gets, the more likely they are to be a customer longer. (Another way of staying younger people have newer accounts | ) |
Moderate positive relationship| r between 0. and 0.7 | Total_Revolving_Bal and Avg_Utilization_Ratio| .62| (A higher balance lends itself to a higher ratio of use |)|
Weak positive relationship| between 0 and 0.2 | Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1| .38| (Changes in quarterly count are weakly correlated to changes in amount, at least in the same directio |))
Negative corrlations
| Type | Values | Column | Value | Description |
|---|---|---|---|---|
| Weak negative relationship | between 0 and -.25 | Total_Relationship_Count and Total_Trans_Ct |
-.24 | Larger families may mean lesser total transaction count (Similar to the next item) |
| Weak negative relationship | between 0 and -.25 | Total_Relationship_Count and Total_Trans_Amt |
-.35 | Larger families may mean lesser total transaction amount |
| Moderate negative relationship | between -.50 and -.75 | Credit_Limit and Avg_Utilization_Ratio |
-.48 | Higher credit limit generally means you don't need to use as much in proportion to that limit |
| Moderate negative relationship | between -.50 and -.75 | Avg_Open_To_Buy and Avg_Utilization_Ratio |
-.54 | Higher amount you can buy genearlly means you don't need to use as much in proportion to your limit |
Additional analysisΒΆ
The scatter plot matrix can help us visually identify significant patterns and groupings in the data.
# Scatter plot matrix
#num_features = ['Credit_Limit', 'Total_Trans_Amt', 'Months_on_book','Total_Ct_Chng_Q4_Q1']
num_features = ['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']
plt.figure(figsize=(12, 8))
sns.pairplot(data, vars=num_features, hue='Attrition_Flag', diag_kind='kde');
<Figure size 1200x800 with 0 Axes>
Observations
Key observable grouping of Attrited and Existing customers can be seen in the charts. In particular:
Total_Trans_AmtandCredit_LimitMonths_on_bookandTotal_Trans_Amt- In every feature with
Total_Trans_CtandTotal_Trans_AmtandTotal_Ct_Chng_Q4_Q1
Lesser correlation can be seen:
Months_on_bookandCredit_Limit
Also in many features along the diagonal
- Attrited and Existing customers have generally the same skewness
Data Pre-processingΒΆ
Outlier detection and treatmentΒΆ
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations
- There are quite a few outliers in the data
- The data generally appear to be proper values
Remove index columnΒΆ
# Completed in an earlier step
# data.drop(columns=['CLIENTNUM'], inplace=True)
Removing duplicate columnsΒΆ
# Remove Avg open to buy as it is a duplicate of Credit Limit
data.drop(columns=['Avg_Open_To_Buy'], inplace=True)
Get a lists of the columns that are object type to convert into numeric values
Do any columns remain without values?
data.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Fix Income_Category nonsense for abc dataΒΆ
Replace the abc values with the missing value and we can impute in a following step
data['Income_Category'].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64
data['Income_Category'] = data['Income_Category'].replace(['abc'], '$80K - $120K')
# sanity check for Income Category replacement
print(data['Income_Category'].value_counts())
Less than $40K 3561 $80K - $120K 2647 $40K - $60K 1790 $60K - $80K 1402 $120K + 727 Name: Income_Category, dtype: int64
# sanity check for Income Category replacement
print(data['Marital_Status'].value_counts())
Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64
Feature engineeringΒΆ
Pass numerical values for each categorical column for imputation so we will label encode them
These following categories do have missing data.
For imputed data to be created, the categories are changed into numerical data.
marital_status = {
'Married' : 0,
'Single' : 1,
'Divorced' : 2
}
data["Marital_Status"] = data["Marital_Status"].map(marital_status)
education_level = {
'Uneducated' : 0,
'High School': 1,
'Graduate': 2,
'College': 3,
'Post-Graduate': 4,
'Doctorate': 5
}
data["Education_Level"] = data["Education_Level"].map(education_level)
SummaryΒΆ
data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | 1.000 | 0.000 | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | 2.000 | 1.000 | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | 2.000 | 0.000 | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | 1.000 | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | 0.000 | 0.000 | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
The data in the missing columns have been converted into numerics, so they can be imputed.
Data preparation for model buildingΒΆ
Separate features from the target columnΒΆ
The target column is Attrition_Flag
It has already been converted to numeric
# Separating features and the target column
X = data.drop('Attrition_Flag', axis=1)
y = data['Attrition_Flag'].apply(lambda x: 1 if x == "Attrited Customer" else 0)
Split the data into train and test setsΒΆ
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 18) (2026, 18) (2026, 18)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075 Number of rows in validation data = 2026 Number of rows in test data = 2026
Missing value treatmentΒΆ
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Fix Education_Level and Marital_Status missing valuesΒΆ
import pandas as pd
from sklearn.impute import SimpleImputer
# Get list of categorical and numerical columns
cat_cols = list(X_train.select_dtypes(include='object').columns)
num_cols = list(X_train.select_dtypes(include=['int', 'float']).columns)
# Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
X_val[cat_cols] = cat_imputer.transform(X_val[cat_cols])
X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])
# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_val[num_cols] = num_imputer.transform(X_val[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Reverse Mapping for Encoded VariablesΒΆ
## Function to inverse the encoding
def inverse_mapping(x, y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype("category")
X_val[y] = np.round(X_val[y]).map(inv_dict).astype("category")
X_test[y] = np.round(X_test[y]).map(inv_dict).astype("category")
inverse_mapping(marital_status, "Marital_Status")
inverse_mapping(education_level, "Education_Level")
The mappings replace the numberic values with category values
Train DatasetΒΆ
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
F 3193 M 2882 Name: Gender, dtype: int64 ****************************** Graduate 2782 High School 1228 Uneducated 881 College 618 Post-Graduate 312 Doctorate 254 Name: Education_Level, dtype: int64 ****************************** Single 2826 Married 2819 Divorced 430 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 2129 $80K - $120K 1607 $40K - $60K 1059 $60K - $80K 831 $120K + 449 Name: Income_Category, dtype: int64 ****************************** Blue 5655 Silver 339 Gold 69 Platinum 12 Name: Card_Category, dtype: int64 ******************************
Validation DatasetΒΆ
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)
F 1095 M 931 Name: Gender, dtype: int64 ****************************** Graduate 917 High School 404 Uneducated 306 College 199 Post-Graduate 101 Doctorate 99 Name: Education_Level, dtype: int64 ****************************** Married 960 Single 910 Divorced 156 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 736 $80K - $120K 514 $40K - $60K 361 $60K - $80K 279 $120K + 136 Name: Income_Category, dtype: int64 ****************************** Blue 1905 Silver 97 Gold 21 Platinum 3 Name: Card_Category, dtype: int64 ******************************
Test DatasetΒΆ
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 30)
F 1070 M 956 Name: Gender, dtype: int64 ****************************** Graduate 948 High School 381 Uneducated 300 College 196 Post-Graduate 103 Doctorate 98 Name: Education_Level, dtype: int64 ****************************** Single 956 Married 908 Divorced 162 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 696 $80K - $120K 526 $40K - $60K 370 $60K - $80K 292 $120K + 142 Name: Income_Category, dtype: int64 ****************************** Blue 1876 Silver 119 Gold 26 Platinum 5 Name: Card_Category, dtype: int64 ******************************
Creating Dummy VariablesΒΆ
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 28) (2026, 28) (2026, 28)
Observation
There are now 28 columns after the mapping There were originally 20 columns (including the target)
Model BuildingΒΆ
Model evaluation criterionΒΆ
The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.
Which metric to optimize?
- We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
- We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
- We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Model Building with original dataΒΆ
Sample code for model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the list models
models.append(("Decision tree", tree.DecisionTreeClassifier(random_state=1, class_weight='balanced')))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train, y_train)
scores_train = recall_score(y_train, model.predict(X_train))
scores_val = recall_score(y_val, model.predict(X_val))
difference1 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference: Bagging: Training Score: 0.9857, Validation Score: 0.8160, Difference: 0.1697 Random forest: Training Score: 1.0000, Validation Score: 0.8067, Difference: 0.1933 Decision tree: Training Score: 1.0000, Validation Score: 0.8190, Difference: 0.1810 AdaBoost: Training Score: 0.8453, Validation Score: 0.8589, Difference: -0.0136 GradientBoost: Training Score: 0.8760, Validation Score: 0.8681, Difference: 0.0079
Observation
The best model with original data is GradientBoost with .0079 difference between Training and Validation. It also has good scores for accuracy on both Training and Validation at 87%.
Model Building with Oversampled dataΒΆ
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of train_X: (10198, 28) After Oversampling, the shape of train_y: (10198,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Decision tree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_train = recall_score(y_train_over, model.predict(X_train_over))
scores_val = recall_score(y_val, model.predict(X_val))
difference2 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference: Bagging: Training Score: 0.9976, Validation Score: 0.8558, Difference: 0.1418 Random forest: Training Score: 1.0000, Validation Score: 0.8374, Difference: 0.1626 GradientBoost: Training Score: 0.9820, Validation Score: 0.8926, Difference: 0.0893 Adaboost: Training Score: 0.9651, Validation Score: 0.8681, Difference: 0.0970 Decision tree: Training Score: 1.0000, Validation Score: 0.8098, Difference: 0.1902
Observations
GradientBoost has the best performance with oversampled data and has better accuracy than with the original data.
Model Building with Undersampled dataΒΆ
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 28) After Under Sampling, the shape of train_y: (1952,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Decision tree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_train = recall_score(y_train_un, model.predict(X_train_un))
scores_val = recall_score(y_val, model.predict(X_val))
difference3 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference: Bagging: Training Score: 0.9928, Validation Score: 0.9141, Difference: 0.0787 Random forest: Training Score: 1.0000, Validation Score: 0.9294, Difference: 0.0706 GradientBoost: Training Score: 0.9795, Validation Score: 0.9632, Difference: 0.0163 Adaboost: Training Score: 0.9539, Validation Score: 0.9693, Difference: -0.0154 Decision tree: Training Score: 1.0000, Validation Score: 0.9018, Difference: 0.0982
Observations
Adaboost and GBM have similar results when under sampling. GBM has a better Training score and comparable Validation Score.re
Overall observationsΒΆ
After building the models, it was observed that both the GBM and Adaboost models, trained on an undersampled dataset, as well as the GBM model trained on an oversampled dataset, exhibited strong performance on both the training and validation datasets.
Sometimes models might overfit after undersampling and oversampling, so it's better to tune the models to get a generalized performance
We will tune the four best models using the same data (original or undersampled or oversampled) as we trained them on before:
- Original: GradientBoost: Training Score: 0.8760, Validation Score: 0.8681, Difference: 0.0079
- Oversampled: GradientBoost: Training Score: 0.9820, Validation Score: 0.8926, Difference: 0.0893
- Undersampled: GradientBoost: Training Score: 0.9795, Validation Score: 0.9632, Difference: 0.0163
- Undersampled: Adaboost: Training Score: 0.9539, Validation Score: 0.9693, Difference: -0.0154
HyperparameterTuningΒΆ
Sample Parameter GridsΒΆ
Note
- Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
- Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
- For Gradient Boosting:
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
- For Adaboost:
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
- For Bagging Classifier:
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
- For Random Forest:
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
- For Decision Trees:
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
- For XGBoost (optional):
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
'''
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
'\n# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n \'min_samples_leaf\': [1, 4, 7],\n \'max_leaf_nodes\' : [10,15],\n \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train,y_train)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'
'''
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
'\n# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n \'min_samples_leaf\': [1, 4, 7],\n \'max_leaf_nodes\' : [10,15],\n \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train_over,y_train_over)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'
'''# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
'# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n \'min_samples_leaf\': [1, 4, 7],\n \'max_leaf_nodes\' : [10,15],\n \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train_un,y_train_un)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'
Gradient Boosting model with Original dataΒΆ
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8124803767660911:
CPU times: total: 1.72 s
Wall time: 24.9 s
tuned_gbm0 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=100,
max_features=0.5,
learning_rate=0.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm0.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm0_train = model_performance_classification_sklearn(
tuned_gbm0, X_train, y_train
)
gbm0_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.949 | 0.984 | 0.766 | 0.861 |
# Checking model's performance on validation set
gbm0_val = model_performance_classification_sklearn(tuned_gbm0, X_val, y_val)
gbm0_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.943 | 0.966 | 0.750 | 0.845 |
Gradient Boosting model with Oversampled dataΒΆ
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8124803767660911:
CPU times: total: 1.95 s
Wall time: 13.8 s
tuned_gbm1 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=100,
max_features=0.5,
learning_rate=.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_over, y_train_over)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm1_train = model_performance_classification_sklearn(tuned_gbm1, X_train_over, y_train_over)
gbm1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.980 | 0.981 | 0.978 | 0.980 |
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.960 | 0.893 | 0.864 | 0.878 |
Gradient Boosting model with Undersampled dataΒΆ
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508320251177395:
CPU times: total: 734 ms
Wall time: 5.22 s
tuned_gbm2 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=75,
max_features=0.7,
learning_rate=0.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm2_train = model_performance_classification_sklearn(
tuned_gbm2, X_train_un, y_train_un
)
gbm2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971 | 0.977 | 0.966 | 0.971 |
# Checking model's performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.936 | 0.954 | 0.732 | 0.828 |
AdaBoost model with Undersampled dataΒΆ
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10, 40, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.9375039246467818:
CPU times: total: 438 ms
Wall time: 2.23 s
tuned_adb = AdaBoostClassifier(
random_state=1,
n_estimators=30,
learning_rate=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
tuned_adb.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=30, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=30, random_state=1)DecisionTreeClassifier(max_depth=2, random_state=1)
DecisionTreeClassifier(max_depth=2, random_state=1)
# Checking model's performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train_un, y_train_un)
adb_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970 | 0.975 | 0.965 | 0.970 |
# Checking model's performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.932 | 0.966 | 0.714 | 0.821 |
Model Comparison and Final Model SelectionΒΆ
# training performance comparison
models_train_comp_df = pd.concat(
[
gbm0_train.T,
gbm1_train.T,
gbm2_train.T,
adb_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Original data",
"Gradient boosting trained with Oversampled data",
"Gradient boosting trained with Undersampled data",
"AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Gradient boosting trained with Original data | Gradient boosting trained with Oversampled data | Gradient boosting trained with Undersampled data | AdaBoost trained with Undersampled data | |
|---|---|---|---|---|
| Accuracy | 0.949 | 0.980 | 0.971 | 0.970 |
| Recall | 0.984 | 0.981 | 0.977 | 0.975 |
| Precision | 0.766 | 0.978 | 0.966 | 0.965 |
| F1 | 0.861 | 0.980 | 0.971 | 0.970 |
# Validation performance comparison
models_train_comp_df = pd.concat(
[ gbm0_val.T, gbm1_val.T, gbm2_val.T, adb_val.T], axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Original data",
"Gradient boosting trained with Oversampled data",
"Gradient boosting trained with Undersampled data",
"AdaBoost trained with Undersampled data",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| Gradient boosting trained with Original data | Gradient boosting trained with Oversampled data | Gradient boosting trained with Undersampled data | AdaBoost trained with Undersampled data | |
|---|---|---|---|---|
| Accuracy | 0.943 | 0.960 | 0.936 | 0.932 |
| Recall | 0.966 | 0.893 | 0.954 | 0.966 |
| Precision | 0.750 | 0.864 | 0.732 | 0.714 |
| F1 | 0.845 | 0.878 | 0.828 | 0.821 |
** Observations**
Gradient boosting trained with Original dataandAdaBoost trained with Undersampled datahad similiar Recall valuesGradient boosting trained with Original datahad slightly better accuracy
Test set final performanceΒΆ
# Let's check `Gradient boosting trained with Original data` on the test set
gbm0_test = model_performance_classification_sklearn(tuned_gbm0, X_test, y_test)
gbm0_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.969 | 0.731 | 0.833 |
confusion_matrix_sklearn(tuned_gbm0, X_test, y_test)
Let's check both Gradient boosting trained with Original data and AdaBoost trained with Undersampled data against the test data
# Let's check the AdaBoost trained with Undersampled data performance on test set
ada_test = model_performance_classification_sklearn(tuned_adb, X_test, y_test)
ada_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.925 | 0.957 | 0.693 | 0.804 |
confusion_matrix_sklearn(tuned_adb, X_test, y_test)
Observation
Gradient boosting trained with Original dataperformed best against the test data.- This performance is in line with what we achieved with this model on the train and validation sets
Gradient boosting trained with Original datais a generalized model
Feature importanceΒΆ
feature_names = X_train.columns
importances = tuned_gbm0.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
Total_Trans_Ct, Total_Trans_Amt, Total_Revolving_Bal, Total_Ct_Chng_Q4_Q1, Total_Amt_Chng_Q4_Q1 were the most significant features for making predictions
Business Insights and ConclusionsΒΆ
Observations
By reviewing the model, we can learn that the following factors are important in credit card attrition:
- The best predictors of whether a customer continues with credit cards is their current usage, transaction amounts, balance.
- The second big indicators are the change in usage both in terms of transaction amounts and transaction counts.
- The third highest indicators are the utilization and the number of times the customer contacts or meets with the bank.
Recommendations
While taking risk to bank into account, the bank could become aggressive with:
- Balance transfer incentives may lead to higher utlization and higher revolving balance, both key indicators for attrition.
- Cash back incentives for card usage
- Provide incentives for the customer to add another product such as personal loans - even if it isn't a credit card. The idea is deepen the customer relations metric and cross sells services.
- Availability and preferred card status at online providers, such as SHOP or PAYPAL or others to increase transaction count.
- Many credit cards also provide "Pay in 4" and charge the credit card over time as incentives to increase amount.
The economy, interest rate changes were not included in the study and may be significant sources. This study does not include reasons or insights into why customers might no longer want our bank's credit cards.